Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

127 ingest more bands for l8 products #161

Conversation

ricardogsilva
Copy link
Contributor

@ricardogsilva ricardogsilva commented Sep 1, 2017

This PR implements the changes proposed in issue #127

@ricardogsilva
Copy link
Contributor Author

ricardogsilva commented Sep 4, 2017

@randomorder
I've been tackling first subtask:

Search and Download should already support multiple bands. Test it and improve it to allow configurable list of bands

This turned out to be a lot more work than originally planned. I had to do some changes in order to get the code to run on my dev box. Additionally, the current implementation makes it very hard to test tasks in isolation. My struggle so far has been not with the complexity of the task, but with the setup of a productive environment.

Regarding the testing part of the task:

  • Just getting airflow to run with the provided Dockerfile is messy - on a related note, please see issues Replace secrets.py python modules with environment variables #159, Invalid config for S2_Download DAG #160 and airflow.cfg on the repo does not match airflow.cfg on the published docker image #162;
  • The code is also somewhat hard to read. Apparently we are not enforcing any coding style. IMHO we should. Python has pep8 and lots of IDEs and dedicated tools have support for it. This makes the code a lot more readable. Personally, I struggle with understanding code that uses super long lines and very few white space.
  • I've done some other refactoring trying to improve readability. Things like moving variables closer to where they are used, cutting down repeated code, etc
  • Testing the download operator is also hard because the current code expects there to be a postgresql database with the scene list. This meant I had to first get the scene list DAG to run. And create a postgresql database locally. And create the correct table. This turned out to be a big overhead too.

I'm trying to make the point that it is hard for a new dev to pick up this code. Perhaps we should invest some time in easing this process?

Back to the subtask at hand, regarding the multiple band support, support for multiple bands was indeed already implemented in the download operator.
However, it seems to me that it is more useful to be able to select exactly which bands one wants to download instead of assuming a sequence. I mean, we might want bands 1, 3 and 7, for example. The previous code assumed that we'd always want bands 1, 2, 3, 4, 5, 6, 7, always a perfect sequence. From personal experience, this is usually not the case with satellite imagery. I've changed the code in order to allow downloading a list of individual named bands.

I think I'll be able to move faster now that I have a more manageable dev environment up.

@ricardogsilva ricardogsilva force-pushed the 127-ingest-more-bands-for-l8-products branch from a500bcb to a0fc6f4 Compare September 4, 2017 14:33
@ricardogsilva
Copy link
Contributor Author

ricardogsilva commented Sep 5, 2017

@randomorder

I'm continuing the work. Moved on to second subtask:

Improve the GDALTranslate operator to have XCom pull parameters as parameters to the operator

Here I've actually deviated a bit from your instructions, hence I need some feedback on the proposed changes.
Reading the airflow docs on XComs and the API reference, I've decided to stop using the explicit task_instance.xcom_push() and rely on the operator's return value instead. This makes the code cleaner since every operator is now returning a proper result, instead of just returning True.
In addition to this, I've added the attribute get_inputs_from to the operators. This serves as a way to indicate what task(s) are upstream and need to be xcom_pulled. Since I'm using operator's return as a way to pass the xcoms around there is no need to specify a key in the task_instance.xcom_pull() method anymore. This also makes the code simpler.

You can find examples of this refactored workflow in the plugins/gdal_plugin.py:GDALTranslateOperator and also in the plugins/search_download_daraa_plugin.py modules. I plan to follow suit and apply this method of inter task communication to the remaining operators of the DAG I'm working with - that is if you agree.

I've also implemented a generate_dag() function in order to make the search_download_daraa.py module able to cope with multiple AOIs. This means that it will be possible to have a DAG for each individual AOI, each with its own bands too. I'm not happy about the current way to specify each band - it is being defined directly on the DAG generating module. I'd prefer to pass this as an option in airflow.cfg. The airflow.cfg file is a nice place to store custom configuration values because we can use airflow's configuration module in order to retrieve them. A pattern like:

# airflow.cfg
[my_dag]
config1=something
config2=other

# elsewhere, where we need the config
from airflow import configuration
value1 = configuration.get("my_dag", "config1")

is nice in order to retrieve custom config. I've not been doing this for now mainly because of issue #162.

I've also tried to simplify the parameters in the existing operators, throwing away unused stuff (because of YAGNI) in order to make the code simpler and more readable.

This is still very much a work in progress and I'd appreciate any feedback that you may have ;)

@randomorder
Copy link
Member

Back to the subtask at hand, regarding the multiple band support, support for multiple bands was indeed already implemented in the download operator.
However, it seems to me that it is more useful to be able to select exactly which bands one wants to download instead of assuming a sequence. I mean, we might want bands 1, 3 and 7, for example. The previous code assumed that we'd always want bands 1, 2, 3, 4, 5, 6, 7, always a perfect sequence. From personal experience, this is usually not the case with satellite imagery. I've changed the code in order to allow downloading a list of individual named bands.

Ok, sounds good

I'm continuing the work. Moved on to second subtask:

Improve the GDALTranslate operator to have XCom pull parameters as parameters to the operator

Here I've actually deviated a bit from your instructions, hence I need some feedback on the proposed changes.
Reading the airflow docs on XComs and the API reference, I've decided to stop using the explicit task_instance.xcom_push() and rely on the operator's return value instead. This makes the code cleaner since every operator is now returning a proper result, instead of just returning True.
In addition to this, I've added the attribute get_inputs_from to the operators. This serves as a way to indicate what task(s) are upstream and need to be xcom_pulled. Since I'm using operator's return as a way to pass the xcoms around there is no need to specify a key in the task_instance.xcom_pull() method anymore. This also makes the code simpler.

You can find examples of this refactored workflow in the plugins/gdal_plugin.py:GDALTranslateOperator and also in the plugins/search_download_daraa_plugin.py modules. I plan to follow suit and apply this method of inter task communication to the remaining operators of the DAG I'm working with - that is if you agree.

Generally speaking would be good to simplify XCom code ( #132 ). Please comment on #132 so that we can track this approach

I've also implemented a generate_dag() function in order to make the search_download_daraa.py module able to cope with multiple AOIs. This means that it will be possible to have a DAG for each individual AOI, each with its own bands too. I'm not happy about the current way to specify each band - it is being defined directly on the DAG generating module.

Sounds good to me

airflow.cfg

[my_dag]
config1=something
config2=other

elsewhere, where we need the config

from airflow import configuration
value1 = configuration.get("my_dag", "config1")
is nice in order to retrieve custom config. I've not been doing this for now mainly because of issue #162.

I've also tried to simplify the parameters in the existing operators, throwing away unused stuff (because of YAGNI) in order to make the code simpler and more readable.

Please take a look at #118
Config keys management has already been discussed and a potential refactor is already in progress.

@ricardogsilva
Copy link
Contributor Author

@randomorder

Moving on to third subtask:

Improve the "landsat8_translate_daraa_task" and "landsat8_addo_daraa_task" to support multiple bands.

I've decided to support parallel tasks at the DAG level. As such I've actually refactored the download, translate and addo tasks in order to not support multiple bands, but only to support single files. The DAG has also been refactored in order to process these tasks in parallel. The result is looking something like:
parallel_tasks

Which seems to comply with the task's description. This DAG structure will only work in true parallel fashion if using an airflow executor other than the default SequentialExecutor and another database other than the default sqlite one.

I've not started to work on the metadata and thumbnail tasks, so their order of execution might be wrong in the picture.

Regarding the thumbnail, I am wondering why are we generating a thumbnail instead of just grabbing the one already suplied by the Landsat folks that is also availbale for downloading from s3?

@randomorder
Copy link
Member

I've decided to support parallel tasks at the DAG level. As such I've actually refactored the download, translate and addo tasks in order to not support multiple bands, but only to support single files. The DAG has also been refactored in order to process these tasks in parallel.

Have you changed GDAL Operators as well?

Regarding the thumbnail, I am wondering why are we generating a thumbnail instead of just grabbing the one already suplied by the Landsat folks that is also availbale for downloading from s3?

Yes, we are generating it. We want it to be square, compressed and low resolution. The one provided with the S3 did not match the requisites

@ricardogsilva ricardogsilva force-pushed the 127-ingest-more-bands-for-l8-products branch from eb8eb7b to 30c13e3 Compare September 6, 2017 22:05
@ricardogsilva
Copy link
Contributor Author

@randomorder

I've been working on the last subtask:

Improve product_json_task, product_thumbnail_task, product_thumbnail_task and product_description_task and product_zip_task as needed to add support multiple bands

I've done some extensive cleanup and refactoring, especially in the operator that reads MTL files, which was kind of messy before. Hopefully it is more readable now.

I think most of the stuff for this PR is nearing completion.

I've not been able to do much testing yet, mostly due to the way that the DAGs are currently implemented (custom operators + xcoms for inter task communication -> hard to run stuff in isolation).
I'll be performing some functional testing from now on and fixing the inevitable bugs that will surface.

Eventually I'll get back to you and ask for a proper review, when I think that the code is ready.

@simboss simboss added this to the Stakeholder Workshop milestone Sep 7, 2017
@ricardogsilva
Copy link
Contributor Author

ricardogsilva commented Sep 8, 2017

@randomorder

I've been testing my modifications and things seem to be working OK.

At the moment I'm only missing three tasks:

  1. Task that copies the html template

    I believe the code was already broken before I started working on it.

    It seems to want to copy the xml metadata template from a hardcoded directory of ./geo-solutions-work/evo-odas/metadata-ingestion/templates/product_abstract.html. I obviously do not have this path on my dev machine and neither do I seem to have the product_abstract.html file anywhere whithin the repository. What should I do? Leave it broken?

  2. Task that generates the thumbnail

    I'm getting a strange error:

    Traceback (most recent call last):
       File "/usr/local/bin/airflow", line 28, in <module>
         args.func(args)
       File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 422, in run
         pool=args.pool,
       File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
         result = func(*args, **kwargs)
       File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1384, in run
         result = task_copy.execute(context=context)
       File "/usr/local/evoodas/geosolutions/plugins/landsat8_metadata_plugin.py", line 218, in execute
         img = Image(downloaded_thumbnail)
     RuntimeError: Magick: Must specify image size (/tmp/gmjVuAUt) reported by coders/mvg.c:173 (ReadMVGImage)

    I haven't been able to figure it out yet, but haven't spent much time trying either. I'll try to debug and fix.

  3. Task that generates the zip

    Since this task is downstream from the others that have the error I haven't been able to test it correctly yet. This is an unfortunate constraint that arises from the fact that we're using XComs to pass values from one task to another - I cannot easily test tasks in isolation.

Still, here's another screenshot:
integration_tests1

At least I'm getting more greens than reds ;)

I guess you could start reviewing/testing the PR by now. I feel like it won't be long before it is done.

As for the other Landsat8 DAG, the one that updates the scen list, it seems to be working fine.

@randomorder
Copy link
Member

Task that copies the html template

I believe the code was already broken before I started working on it.

It seems to want to copy the xml metadata template from a hardcoded directory of ./geo-solutions-work/evo-odas/metadata-ingestion/templates/product_abstract.html. I obviously do not have this path on my dev machine and neither do I seem to have the product_abstract.html file anywhere whithin the repository. What should I do? Leave it broken?

File is available at: https://github.com/geosolutions-it/evo-odas/blob/master/metadata-ingestion/templates/product_abstract.html
Please use that one

Task that generates the zip

Since this task is downstream from the others that have the error I haven't been able to test it correctly yet. This is an unfortunate constraint that arises from the fact that we're using XComs to pass values from one task to another - I cannot easily test tasks in isolation.

I believed we already touched this. XCom mechanism is cumbersome and I don't like it either but that Airflow provides and what we are using. We can't ditch XComs until we have very good reason and a working replacement for it.

Task that generates the zip

Since this task is downstream from the others that have the error I haven't been able to test it correctly yet. This is an unfortunate constraint that arises from the fact that we're using XComs to pass values from one task to another - I cannot easily test tasks in isolation.

Still, here's another screenshot:
integration_tests1

At least I'm getting more greens than reds ;)

I guess you could start reviewing/testing the PR by now. I feel like it won't be long before it is done.

As for the other Landsat8 DAG, the one that updates the scen list, it seems to be working fine.

Diagram is looking good!
I'll start reviewing while you are working on the final fixes

this was causing the airflow/plugins/gdal_plugin.py file to
be ignored. Since this module is used by the Landsat8 DAG
it was probably not supposed to be ignored. This might fix partially
issue geosolutions-it#158.
All code pertaining to the Landsat8 DAGs has been refactored.
The changes are strictly cosmetic and do not alter the business logic
in any way. Changes were made to increase readability and to conform
with pythons official style guide, as described in pep8.

Changes:

- reordered imports
- lines capped at 79 chars
- replaced all tabs with spaces
- used 4 spaces as indentation consistently everywhere
- used double quotes for docstrings
- respected white space rules around operators and functions
- etc.
Removed redundant dicts and brought parameters closer
to where they are used
The code of the download operator has also been refactored for readability
the changes in this commit refactor both the dag and its custom operators.
they were necessary in order to be able to test tasks in isolation.
these changes drop xcoms in this dag.
Each designated Landsat8 area shall have its own DAG. The DAG
gets generated by the new generate_Dag function
- Switched to using XComs on the operator return values
- Removed some unused operator parameters in order to simplify the code
- use streamlined xcoms
- support single files only
- general refactoring
- paralelism is being dealt with at the DAG level
- download_dir and download_url moved to DEFAULT_ARGS
- refactor of the operators
@ricardogsilva ricardogsilva force-pushed the 127-ingest-more-bands-for-l8-products branch from 2588e50 to bab587c Compare September 12, 2017 16:56
@ricardogsilva
Copy link
Contributor Author

ricardogsilva commented Sep 12, 2017

@randomorder

The work seems to be done. Pending your favorable review I believe the PR could be merged.
I've already rebased this branch on top of master, but did not squash my commits. If you'd prefer me to do that let me know.

Some notes:

  • As stated in the commit message of b5ccf6d, the proposed implementation generates a DAG for each Landsat8 AOI. This is controlled, for now, by the AREAS variable in the search_download_daraa.py file. This is a list of Landsat8Area instances that specifies the characteristics of each area, including which bands to process;
  • DAGs are generated dynamically by the generate_dag function. This function adapts the design of the DAGs logical dependencies according to the number of bands that are to be processed;
  • I've refactored most of the custom operators that are used in the Landsat8 DAGs trying to make the code more readable and easier to maintain;
  • As stated in the commit message for 9fd2ca8 I've also refactored most of the code of the two Landsat8 DAGs in order to follow Python's standard style guide.

@ricardogsilva ricardogsilva changed the title [WIP]127 ingest more bands for l8 products 127 ingest more bands for l8 products Sep 12, 2017
@randomorder randomorder self-assigned this Sep 15, 2017
@randomorder randomorder removed this from the Stakeholder Workshop milestone Sep 15, 2017
@randomorder
Copy link
Member

Review is still in progress
There are changes to shared operators that may impact used other DAGs as well ( https://github.com/ricardogsilva/evo-odas/blob/bab587cce2d947c2c53eb444a4c6a743bb0b9118/airflow/plugins/gdal_plugin.py
)
For this reason I am postponing this to after the workshop

@simboss simboss requested a review from randomorder September 28, 2017 14:18
@randomorder randomorder merged commit bdb2c09 into geosolutions-it:master Oct 2, 2017
@aaime aaime removed the in progress label Oct 2, 2017
randomorder added a commit to randomorder/evo-odas that referenced this pull request Oct 4, 2017
…more-bands-for-l8-products

127 ingest more bands for l8 products
randomorder added a commit to randomorder/evo-odas that referenced this pull request Oct 4, 2017
…more-bands-for-l8-products

127 ingest more bands for l8 products
randomorder added a commit to randomorder/evo-odas that referenced this pull request Oct 4, 2017
…more-bands-for-l8-products

127 ingest more bands for l8 products
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants